PLAY.TEX[TEX,DEK] - www.SailDart.org

perm filename PLAY.TEX[TEX,DEK] blob sn#392815 filedate 1978-11-04 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00006 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00002 00002	\input acphdr
C00015 00003	%folio 621 galley 1 (C) Addison-Wesley 1978	*
C00030 00004	%folio 624 galley 2a (C) Addison-Wesley 1978	*
C00042 00005	%New material 1 [1] (C) Addison-Wesley 1978	*
C00050 00006	%New material 2 [3] (C) Addison-Wesley 1978	*
C00068 ENDMK
C⊗;
\input acphdr
\runninglefthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\runningrighthead{ARITHMETIC---FIRST PROOFS $\copyright$ 1978}
\section{4.x}
\setcount0 461
\tenpoint
%folio 618 galley 9 Unreadable. (C) Addison-Wesley 1978	*
\yskip It is clear that the results we have
obtained about chains for polynomials in a single variable can
be extended without difficulty to multivariate polynomials.
For example, if we want to find an optimum scheme for polynomial
evaluation {\sl without} adaptation of coefficients, we can
regard $u(x)$ as a polynomial in the $n + 2$ variables $x$, $u↓n$,
$\ldotss$, $u↓1$, $u↓0$; exercise 38 shows that $n$ multiplications
and $n$ additions are necessary in this case. Indeed, A. Borodin
[{\sl Theory of Machines and Computations}, ed.\ by Z. Kohavi
and A. Paz (New York: Academic Press, 1971), 45--58] has proved
that Horner's rule (2) is essentially the {\sl only} way to compute
$u(x)$ in $2n$ operations without preconditioning.

\yskip With minor variations, the above methods can be extended to chains
involving division, i.e., to rational functions as well as polynomials.
Curiously, the continued-fraction analog of Horner's rule now
turns out to be optimal from an operation-count standpoint, if
multiplication and division speeds are equal, even when preconditioning
is allowed (see exercise 37).

Sometimes division is helpful during the evaluation of polynomials,
even though polynomials are defined only in terms of multiplication and
addition; we have seen examples of this in the Shaw--Traub algorithms for
polynomial derivatives. Another example is the polynomial $x↑n+\cdots+x+1$:
Since this polynomial
can be written $(x↑{n+1}-1)/(x-1)$, we can evaluate it with $l(n+1)$
multiplications (see Section 4.6.3), two subtractions, and one division, while
techniques that avoid division seem to require about three times as many operations
(see exercise 43).

\subsectionbegin{Special multivariate polynomials} The {\sl determinant} of an
$n\times n$ matrix may be considered to be a polynomial in $n↑2$ variables $x↓{ij}$,
$1≤i,j≤n$. If $x↓{11}≠0$, we have
$$\twoline{\hskip-10pt\det\left(\vcenter{\halign{$#\hfill$⊗\quad$#\hfill$⊗$\quad
#\quad$⊗$#\hfill$\cr
x↓{11}⊗x↓{12}⊗\ldots⊗x↓{1n}\cr
x↓{21}⊗x↓{22}⊗\ldots⊗x↓{2n}\cr
\9\vdots⊗\9\vdots⊗⊗\9\vdots\cr
x↓{n1}⊗x↓{n2}⊗\ldots⊗x↓{nn}\cr}}\right)}{8pt}{=x↓{11}\det\left(\vcenter{
\halign{$#\hfill$⊗$\null#\hfill$⊗$\quad#\quad$⊗$#\hfill$⊗$\null#\hfill$\cr
x↓{22}⊗-(x↓{21}/x↓{11})x↓{12}⊗\ldots⊗x↓{2n}⊗-(x↓{21}/x↓{11})x↓{1n}\cr
⊗\quad\vdots⊗⊗⊗\quad\vdots\cr
x↓{n2}⊗-(x↓{n1}/x↓{11})x↓{12}⊗\ldots⊗x↓{nn}⊗-(x↓{n1}/x↓{11})x↓{1n}\cr}}\right).
\qquad(31)\hskip-10pt}$$
The determinant of an $n\times n$ matrix may therefore be evaluated by evaluating
the determinant of an $(n-1)\times(n-1)$ matrix and using an additional
$(n-1)↑2+1$ multiplications, $(n-1)↑2$ additions, and $n-1$ divisions. Since a
$2\times2$ determinant can be evaluated with two multiplications and one
addition, we see that the determinant of almost all matrices (namely those for
which no division by zero is required) can be computed with at most $(2n↑3-3n↑2+
7n-6)/6$ multiplications, $(2n↑3-3n↑2+n)/6$ additions, and $(n↑2-n-2)/2$
divisions.

When zero occurs, the determinant is even easier to compute. For example, if
$x↓{11}=0$ but $x↓{21}≠0$, we have
$$\def\\{\vjust{\baselineskip 3pt\:d\vskip 4pt\hjust{.}\hjust{.}\hjust{.}}}
\baselineskip 9pt
\scriptstyle\quad\det\left(\vcenter{\halign{$\scriptstyle\ctr{#}$⊗$\scriptstyle\quad
\ctr{#}$⊗$\scriptstyle\quad#\quad$⊗$\scriptstyle\ctr{#}$\cr
0⊗x↓{12}⊗.\,.\,.⊗x↓{1n}\cr
x↓{21}⊗x↓{22}⊗.\,.\,.⊗x↓{2n}\cr
x↓{31}⊗x↓{32}⊗.\,.\,.⊗x↓{3n}\cr
\\⊗\\⊗⊗\\\cr
x↓{n1}⊗x↓{n2}⊗.\,.\,.⊗x↓{nn}\cr}}\right)\,\,=\,\,-x↓{21}\det\left(\vcenter{
\halign{$\scriptstyle\ctr{#}$⊗$\scriptstyle\quad#\quad$⊗$\scriptstyle\ctr{#}$\cr
x↓{12}⊗.\,.\,.⊗x↓{1n}\cr
x↓{32}-(x↓{31}/x↓{21})x↓{22}⊗.\,.\,.⊗x↓{3n}-(x↓{31}/x↓{21})x↓{2n}\cr
\\⊗⊗\\\cr
x↓{n2}-(x↓{n1}/x↓{21})x↓{22}⊗.\,.\,.⊗x↓{nn}-(x↓{n1}/x↓{21})x↓{2n}\cr}}\right)
\textstyle.\eqno(32)$$
Here the reduction to an $(n-1)\times(n-1)$ determinant saves $n-1$ of the
multiplications and $n-1$ of the additions used in (31), and this certainly
compensates for the additional bookkeeping required to recognize this case.
Therefore any determinant can be evaluated with roughly ${2\over3}n↑3$ arithmetic
operations (including division); this is remarkable, since it is a polynomial with
$n!$ terms and $n$ variables in each term.

If we want to evaluate the determinant of a matrix with {\sl integer} elements,
the above process appears to be unattractive since it requires rational
arithmetic.\eject\  %preparation for bad page break ahead
However, we can use the method to evaluate the determinant mod $p$,
for any prime $p$, since division mod $p$ is possible (exercise 4.5.2--15). If
this is done for sufficiently many primes $p$, the exact value of the
determinant can be found as explained in Section 4.3.2, since Hadamard's
inequality (4.6.1--25) gives an upper bound on the magnitude.

\yskip
The coefficients of the {\sl characteristic polynomial\/} $\det(xI-X)$ of an
$n\times n$ matrix $X$ can also be computed in $O(n↑3)$ steps; cf.\ J. H.
Wilkinson, {\sl The Algebraic Eigenvalue Problem} (Oxford: Clarendon Press,
1965), 353--355, 410--411.

\yskip
The {\sl permanent} of a matrix is a polynomial that is very similar to the
determinant; the only difference is that all of its nonzero coefficients are $+1$:
$$\hjust{per}\left(\vcenter{\halign{$#\hfill$⊗$\quad#\quad$⊗$#\hfill$\cr
x↓{11}⊗\ldots⊗x↓{1n}\cr
\9\vdots⊗⊗\9\vdots\cr
x↓{n1}⊗\ldots⊗x↓{nn}\cr}}\right)=\sum x↓{1j↓1}x↓{2j↓2}\ldotsm x↓{nj↓n}\eqno(33)$$
summed over all permutations $j↓1\,j↓2\ldotsm j↓n$ of $\{1,2,\ldotss,n\}$. No way
to evaluate the permanent as efficiently as the determinant is known; exercises 9
and 10 show that substantially less than $n!$ operations will suffice, for large
$n$, but the execution time of all known methods
still grows exponentially with the size of the matrix.

\yskip Another fundamental operation involving matrices is, of course, {\sl
matrix multiplication:}\xskip If $X=(x↓{ij})$ is an $m\times n$ matrix, $Y=(y↓{jk})$
is an $n\times s$ matrix, and $Z=(z↓{ik})$ is an $m\times s$ matrix, then
$Z=XY$ means that
$$\chop to 9pt{z↓{ik}=\sum↓{1≤j≤n}x↓{ij\,}y↓{jk},\qquad 1≤i≤m,\qquad 1≤k≤s.}
\eqno(34)$$
This equation may be regarded as the computation of $ms$ simultaneous polynomials 
in $mn+ns$ variables; each polynomial is the ``inner product'' of two $n$-place
vectors. A brute-force calculation would involve $mns$ multiplications and
$ms(n-1)$ additions; but S. Winograd has discovered an ingenious way to trade
about half of the multiplications for additions:
$$\vjust{\halign{$\hfill\dispstyle#\hfill$\cr
z↓{ik}=\sum↓{1≤j≤n/2}(x↓{i,2j}+y↓{\,2j-1,k})(x↓{i,2j-1}+y↓{\,2j,k})-a↓i-b↓k
+c↓{ik};\cr
\noalign{\vskip4pt}
a↓i=\sum↓{1≤j≤n/2}x↓{i,2j\,}x↓{i,2j-1};\qquad
b↓k=\sum↓{1≤j≤n/2}y↓{\,2j-1,k\,}y↓{\,2j,k};\cr
\noalign{\vskip4pt}
c↓{ik}=\left\{\lpile{0,\cr x↓{in}y↓{nk},\cr}\qquad\lpile{n\hjust{ even;}\cr
n\hjust{ odd.}\cr}\right.\cr}}\eqno(35)$$
This scheme uses $\lceil n/2\rceil ms+\lfloor n/2\rfloor(m+s)$ multiplications and
$(n+2)ms+{(\lfloor n/2\rfloor-1)}\*{(ms+m+s)}$ additions; the total number of
operations has increased slightly, but the number of multiplications has
roughly been halved. This construction led many people to look more closely at
the problem of matrix multiplication, and for several years it was conjectured
that $n↑3/2$ multiplications would be necessary to multiply $n\times n$ matrices.
%folio 621 galley 1 (C) Addison-Wesley 1978	*
An even better scheme for large $n$, discovered
by Volker Strassen in 1968, is based on the fact that the product
of $2 \times 2$ matrices can be evaluated with only 7 multiplications,
without relying on the commutativity of multiplication as in
(35). Therefore $2n \times 2n$ matrices can be partitioned into
four $n \times n$ matrices, and the idea can be used recursively
to obtain the product of $2↑k \times 2↑k$ matrices with only
$7↑k$ multiplications instead of $(2↑k)↑3 = 8↑k$. Strassen's
original $2 \times 2$ identity [{\sl Numer.\ Math.\ \bf 13} (1969),
354--356] used 7 multiplications and 18 additions; S. Winograd
later discovered the following more economical formula:
\mathrm bef \mathit hkl
$$\vjust{\halign{\hjust to size{$\dispstyle#$}\cr
\left({a\atop c}\9{b\atop d}\right)
\left({A\atop B}\9{C\atop D}\right)=\hfill\cr\noalign{\vskip6pt}
\hfill\left({aA+bB\atop w+(a{-}c)(D{-}C)-d(A{-}B{-}C{+}D)}\9{w+(c{+}d)(C{-}A)+
(a{+}b{-}c{-}d)\hskip1pt D\atop
w+(a{-}c)(D{-}C)+(c{+}d)(C{-}A)}\right)\,\hjust{(36)}\cr}}$$
\mathrm adf \mathit gjl
where $w = aA - (a{-}c{-}d)(A{-}C{+}D)$. If intermediate
results are appropriately saved, (36) involves 7 multiplications
and only 15 additions; by induction on $k$, we can multiply
$2↑k \times 2↑k$ matrices with $7↑k$ multiplications
and $5(7↑k - 4↑k)$ additions. The total number of operations
needed to multiply $n \times n$ matrices has therefore been
reduced from order $n↑3$ to $O(n↑{\lg7}) = O(n↑{2.8074})$.
A similar reduction applies also to the
evaluation of determinants and matrix inverses; cf.\ J. R. Bunch
and J. E. Hopcroft, {\sl Math.\ Comp.\ \bf 28} (1974), 231--236.
Exercise 59 discusses a further improvement by V. J. Pan, who discovered in 1978
that the exponent in the running time can be reduced to less than $\lg 7$.

These theoretical results are quite striking, but
from a practical standpoint they are of limited use because
$n$ must be very large before we overcome the effect of additional
bookkeeping costs. Richard Brent [Stanford Computer Science
report CS157 (March, 1970), see also {\sl Numer.\ Math.\ \bf 16}
(1970), 145--156] found that a careful implementation of Winograd's
scheme (35), with appropriate scaling for numerical stability,
became better than the conventional scheme only when $n ≥ 40$,
and it saved 7 percent of the running time when $n = 100$. For
complex arithmetic the situation was somewhat different; (35)
became advantageous for $n > 20$, and saved 18 percent when
$n = 100$. He estimated that Strassen's scheme would not begin
to excel over (35) until $n\approx250$; and such enormous matrices, containing
more than 60,000 entries, rarely occur in practice
(unless they are very sparse, when other techniques apply).

\yskip By contrast, the methods we shall discuss next
are eminently practical and have found wide use. The {\sl finite
Fourier transform} $f$ of a complex-valued function $F$ of $n$
variables, over respective domains of $m↓1$, $\ldotss$, $m↓n$ elements, is
defined by the equation
$$\chop to 18pt{
f(s↓1, \ldotss , s↓n) = \sum ↓{{\scriptstyle 0≤t↓1<m↓1\atop\scriptstyle\cdot\,\cdot
\,\cdot}\atop\scriptstyle 0≤t↓n<m↓n}
\exp\left(2πi\left({s↓1t↓1\over m↓1} +\cdots + {s↓nt↓n\over
m↓n}\right)\right)\,F(t↓1, \ldotss , t↓n)}\eqno (37)$$
for $0 ≤ s↓1 < m↓1$, $\ldotss$, $0 ≤ s↓n < m↓n$; the
name ``transform'' is justified because we can recover the values
$F(t↓1, \ldotss , t↓n)$ from the values $f(s↓1, \ldotss , s↓n)$,
as shown in exercise 13. In the important special case that
all $m↓j = 2$, we have
$$\chop to 12pt{f(s↓1, \ldotss , s↓n) = \sum ↓{0≤t↓1, \ldotss , t↓n≤1} (-1)↑{s↓1t↓1
+\cdots + s↓nt↓n}F(t↓1, \ldotss , t↓n)}\eqno (38)$$
for $0 ≤ s↓1, \ldotss , s↓n ≤ 1$, and this may be
regarded as a simultaneous evaluation of $2↑n$ linear polynomials
in $2↑n$ variables $F(t↓1, \ldotss , t↓n)$. A well-known technique
due to F. Yates [{\sl The Design and Analysis of Factorial Experiments}
(Harpenden: Imperial Bureau of Soil Sciences, 1937)] can be
used to reduce the number of additions implied in (38) from
$2↑n(2↑n - 1)$ to $n2↑n$. Yates's method can be understood by
considering the case $n = 3$: Let $x↓{t↓1t↓2t↓3} = F(t↓1, t↓2,
t↓3)$.
$$\baselineskip10pt
\vjust{\halign to size{$\scriptstyle#\hfill$\tabskip0pt plus 10pt⊗
$\scriptstyle\hfill#\hfill$⊗$\scriptstyle\hfill#\hfill$⊗$\scriptstyle\hfill#\hfill
$\tabskip0pt\cr
\hjust{\:e Given\hskip-5pt}⊗\hjust{\:e First step}⊗\hjust{\:e Second step}⊗
\hjust{\:e Third step}\cr\noalign{\vskip2pt}
x↓{000}⊗ x↓{000} + x↓{001}⊗ x↓{000} + x↓{001}
+ x↓{010} + x↓{011}⊗ x↓{000} + x↓{001} + x↓{010} + x↓{011}
+ x↓{100} + x↓{101} + x↓{110} + x↓{111}\cr
x↓{001}⊗ x↓{010} + x↓{011}⊗ x↓{100} +
x↓{101} + x↓{110} + x↓{111}⊗ x↓{000} - x↓{001} + x↓{010}
- x↓{011} + x↓{100} - x↓{101} + x↓{110} - x↓{111}\cr
x↓{010}⊗ x↓{100} + x↓{101}⊗ x↓{000} -
x↓{001} + x↓{010} - x↓{011}⊗ x↓{000} + x↓{001} - x↓{010}
- x↓{011} + x↓{100} + x↓{101} - x↓{110} - x↓{111}\cr
x↓{011}⊗ x↓{110} + x↓{111}⊗ x↓{100} -
x↓{101} + x↓{110} - x↓{111}⊗ x↓{000} - x↓{001} - x↓{010}
+ x↓{011} + x↓{100} - x↓{101} - x↓{110} + x↓{111}\cr
x↓{100}⊗ x↓{000} - x↓{001}⊗ x↓{000} +
x↓{001} - x↓{010} - x↓{011}⊗ x↓{000} + x↓{001} + x↓{010}
+ x↓{011} - x↓{100} - x↓{101} - x↓{110} - x↓{111}\cr
x↓{101}⊗ x↓{010} - x↓{011}⊗ x↓{100} +
x↓{101} - x↓{110} - x↓{111}⊗ x↓{000} - x↓{001} + x↓{010}
- x↓{011} - x↓{100} + x↓{101} - x↓{110} + x↓{111}\cr
x↓{110}⊗ x↓{100} - x↓{101}⊗ x↓{000} -
x↓{001} - x↓{010} + x↓{011}⊗ x↓{000} + x↓{001} - x↓{010}
- x↓{011} - x↓{100} - x↓{101} + x↓{110} + x↓{111}\cr
x↓{111}⊗ x↓{110} - x↓{111}⊗ x↓{100} -
x↓{101} - x↓{110} + x↓{111}⊗ x↓{000} - x↓{001} - x↓{010}
+ x↓{011} - x↓{100} + x↓{101} + x↓{110} - x↓{111}\cr}}$$
To get from the ``Given'' to the ``First step''
requires four additions and four subtractions; and the interesting
feature of Yates's method is that exactly the same transformation
that takes us from ``Given'' to ``First step'' will take us
from ``First step'' to ``Second step'' and from ``Second step''
to ``Third step.'' In each case we do four additions, then four
subtractions; and after three steps we have the desired Fourier
transform $f(s↓1, s↓2, s↓3)$ in the place originally occupied
by $F(s↓1, s↓2, s↓3)$!

This special case is often called the {\sl Walsh
transform} of $2↑n$ data elements, since the corresponding pattern
of signs was studied by J. L. Walsh [{\sl Amer.\ J. Math.\ \bf 45}
(1923), 5--24]. Note that the number of sign changes from left
to right in the ``Third step'' above assumes the respective
values 0, 7, 3, 4, 1, 6, 2, 5. Walsh observed that there will
be exactly 0, 1, $\ldotss$, $2↑n - 1$ sign changes in some order
in the general case, so the coefficients provide discrete approximations
to sine waves with various frequencies.\xskip (See H. F. Harmuth,
{\sl IEEE Spectrum \bf 6}, 11 (Nov. 1969), 82--91, for applications
of this property; and see Section 7.2.1 for further discussion
of the Walsh coefficients.)

Yates's method can be generalized to the evaluation
of any finite Fourier transform, and, in fact, to the evaluation
of any sums that can be written
$$\twoline{\hskip-10pt f(s↓1, s↓2, \ldotss , s↓n) =}{6pt}{\chop to 18pt{\sum ↓
{{\scriptstyle0≤t↓1<m↓1\atop
\scriptstyle\cdot\,\cdot\,\cdot}\atop\scriptstyle0≤t↓n<m↓n}
\hskip-3pt
g↓1(s↓1, s↓2, \ldotss , s↓n, t↓1)g↓2(s↓2, \ldotss , s↓n, t↓2) \ldotsm
g↓n(s↓n, t↓n)F(t↓1, t↓2, \ldotss , t↓n)\quad(39)}\hskip-10pt}$$
for $0≤s↓j<m↓j$, given the functions $g↓j(s↓j, \ldotss , s↓n, t↓j)$.
We may carry out the evaluation step by step as follows:
$$\def\\#1{{\hjust to 45pt{\hskip0pt plus100pt minus 100pt$\scriptstyle#1$
\hskip0pt plus100pt minus 100pt}}}
\eqalignno{f↑{[0]}(t↓1, t↓2, t↓3, \ldotss , t↓n) ⊗= F(t↓1,
t↓2, t↓3, \ldotss , t↓n);\cr
\noalign{\vskip7pt}
f↑{[1]}(s↓n, t↓1, t↓2, \ldotss , t↓{n-1}) ⊗= \sum
↓\\{0≤t↓n<m↓n} g↓n(s↓n, t↓n)f↑{[0]}(t↓1, t↓2, \ldotss , t↓n);\cr
\noalign{\vskip4pt}
f↑{[2]}(s↓{n-1}, s↓n, t↓1, \ldotss , t↓{n-2})
⊗= \sum ↓\\{0≤t↓{n-1}<m↓{n-1}} g↓{n-1}(s↓{n-1}, s↓n, t↓{n-1})f↑{[1]}(s↓n,
t↓1, \ldotss , t↓{n-1});\cr
\noalign{\vskip4pt}
⊗\9\vdots\cr
\noalign{\vskip4pt}
f↑{[n]}(s↓1, s↓2, s↓3, \ldotss , s↓n) ⊗= \sum
↓\\{0≤t↓1<m↓1} g↓1(s↓1, \ldotss , s↓n, t↓1)f↑{[n-1]}(s↓2, s↓3,
\ldotss , s↓n, t↓1);\cr
\noalign{\vskip4pt}
f(s↓1, s↓2, s↓3, \ldotss , s↓n)
⊗= f↑{[n]}(s↓1, s↓2, s↓3, \ldotss , s↓n).⊗(40)\cr}$$
For Yates's method as shown above, $g↓j(s↓j, \ldotss
, s↓n, t↓j) = (-1)↑{s↓jt↓j}$; $f↑{[0]}(t↓1, t↓2,
t↓3)$ represents the ``Given''; $f↑{[1]}(s↓3, t↓1, t↓2)$ represents
the ``First step''; etc. Whenever a sum can be put into the
form of (39), for reasonably simple functions $g↓j(s↓j, \ldotss
, s↓n, t↓j)$, the scheme (40) will reduce the amount of computation
from order $N↑2$ to order $N \log N$ or thereabouts, where $N =
m↓1 \ldotsm m↓{n\,}$; furthermore this scheme is ideally suited
to parallel computation. The important special case of one-dimensional
Fourier transforms is discussed in exercises 14 and 53; we have
considered the one-dimensional case also in Section 4.3.3.
%folio 624 galley 2a (C) Addison-Wesley 1978	*
\yskip
Let us consider one more special case of polynomial
evaluation. {\sl Lagrange's interpolation polynomial\/} of order
$n$, which we shall write as
$$\eqalignno{u↑{[n]}(x) ⊗= y↓0 {(x - x↓1)(x - x↓2) \ldotsm (x
- x↓n)\over (x↓0 - x↓1)(x↓0 - x↓2) \ldotsm (x↓0 - x↓n)}\cr\noalign{\vskip4pt}
⊗\qquad\null + y↓1 {(x - x↓0)(x - x↓2) \ldotsm (x - x↓n)\over (x↓1
- x↓0)(x↓1 - x↓2) \ldotsm (x↓1 - x↓n)} + \cdots \cr\noalign{\vskip4pt}
⊗\qquad\null+y↓n {(x - x↓0)(x - x↓1) \ldotsm (x - x↓{n-1})\over
(x↓n - x↓0)(x↓n - x↓1) \ldotsm (x↓n - x↓{n-1})} ,⊗(41)\cr}$$
is the only polynomial of degree $≤n$ in $x$ that takes
on the respective values $y↓0$, $y↓1$, $\ldotss$, $y↓n$ at the $n
+ 1$ distinct points $x = x↓0$, $x↓1$, $\ldotss$, $x↓n$.\xskip$\biglp$For it is
evident from (41) that $u↑{[n]}(x↓k) = y↓k$ for $0 ≤ k ≤ n$.
If $f(x)$ is any such polynomial of degree $≤n$, then $g(x)
= f(x) - u↑{[n]}(x)$ is of degree $≤n$, and $g(x)$ is zero for
$x = x↓0$, $x↓1$, $\ldotss$, $x↓n$; therefore $g(x)$ is a multiple
of the polynomial $(x - x↓0)(x - x↓1) \ldotsm (x - x↓n)$. The
degree of the latter polynomial is greater than $n$, so $g(x)
= 0$.$\bigrp$\xskip If we assume that the values of a function in some table
are well approximated by a polynomial, Lagrange's formula (41)
may therefore be used to ``interpolate'' for values of the
function at points $x$ not appearing in the table. Unfortunately,
there seem to be quite a few additions, subtractions, multiplications,
and divisions in Lagrange's formula; in fact, there are $n$
additions, $2n↑2 + 2$ subtractions, $2n↑2 + n - 1$ multiplications,
and $n + 1$ divisions. Fortunately (as we might suspect), improvement
is possible.

The basic idea for simplifying (41) is to note that
$u↑{[n]}(x) - u↑{[n-1]}(x)$ is zero for $x = x↓0$, $\ldotss$, $x↓{n-1}$;
thus $u↑{[n]}(x) - u↑{[n-1]}(x)$ is a polynomial of degree $≤n$
and a multiple of $(x - x↓0) \ldotsm (x - x↓{n-1})$. We conclude
that $u↑{[n]}(x) = {α↓n(x - x↓0) \ldotsm (x - x↓{n-1})} + u↑{[n-1]}(x)$,
where $α↓n$ is a constant. This leads us to {\sl Newton's interpolation
formula}$$\baselineskip14pt\eqalignno{u↑{[n]}(x)⊗=α↓n(x-x↓0)(x-x↓1)\ldotsm
(x - x↓{n-1}) + \cdots \cr ⊗\qquad\null+ α↓2(x - x↓0)(x
- x↓1) + α↓1(x - x↓0) + α↓0,⊗(42)\cr}$$
where the $α$'s are some constants we should like to
determine from $x↓0$, $x↓1$, $\ldotss$, $x↓n$, $y↓0$, $y↓1$, $\ldotss$, $y↓n$.
Note that this formula holds for all $n$; the coefficient $α↓k$ does not depend
on $x↓{k+1}$, $\ldotss$, $x↓n$, $y↓{k+1}$, $\ldotss$, or $y↓n$. Once the
$α$'s are known, Newton's interpolation formula is convenient
for calculation, since we may generalize Horner's rule once
again and write
$$\qquad u↑{[n]}(x) = \biglp (\ldotsm (α↓n(x{-}x↓{n-1}) + α↓{n-1})(x
{-}x↓{n-2}) +\cdotss)(x{-}x↓0) + α↓0\bigrp.\eqno(43)$$
This requires $n$ multiplications and $2n$ additions.
Alternatively, we may evaluate each of the individual terms of
(42) from right to left; with $2n - 1$ multiplications and $2n$
additions we thereby calculate all of the values $u↑{[0]}(x)$,
$u↑{[1]}(x)$, $\ldotss$, $u↑{[n]}(x)$, and this indicates whether
or not an interpolation process is ``converging.''

The coefficients $α↓k$ in Newton's formula may be found by computing
the {\sl divided differences} in the following tableau (shown
for $n = 3$):
$$\def\\{\noalign{\vskip-4pt}}
\vjust{\halign to 300pt{$#\hfill$\tabskip0pt plus 10pt
⊗$#\hfill$⊗$#\hfill$⊗$#\hfill$\tabskip0pt\cr
y↓0\cr\\
⊗(y↓1\!-\!y↓0)/(x↓1\!-\!x↓0) = y↑\prime↓{1}\cr\\
y↓1⊗⊗(y↑\prime↓{2}\!-\!y↑\prime↓{1})/(x↓2\!-\!x↓0) = y↓2↑{\prime\prime}\cr\\
⊗(y↓2\!-\!y↓1)/(x↓2\!-\!x↓1) = y↑\prime↓{2}⊗
⊗(y↓3\!-\!y↓2)/(x↓3\!-\!x↓0) = y↓3↑{\prime\prime\prime}\hskip-48pt\cr\\
y↓2⊗⊗(y↑\prime↓{3}\!-\!y↑\prime↓{2})/(x↓3\!-\!x↓1) = y↓3↑{\prime\prime}\cr\\
⊗(y↓3\!-\!y↓2)/(x↓3\!-\!x↓2) = y↓3↑\prime\cr\\
y↓3\cr}}\eqno(44)$$
It is possible to prove that
$α↓0 = y↓0$, $α↓1 = y↑\prime↓{1}$, $α↓2=y↓2↑{\prime\prime}$, etc., and to show that
the divided differences have important relations to the derivatives
of the function being interpolated; see exercise 15. Therefore
the following calculation $\biglp$corresponding to (44)$\bigrp$
may be used to obtain the $α$'s:
$$\hjust to 310pt{Start with $(α↓0, α↓1, \ldotss , α↓n) ← (y↓0, y↓1, \ldotss
, y↓n)$; then, for $k = 1$, 2, $\ldotss$, $n$
(in this order), set $α↓j ← (α↓j - α↓{j-1})/(x↓j - x↓{j-k})$
for $j = n$, $n - 1$, $\ldotss$, $k$ (in this order).}$$
This process requires ${1\over 2}(n↑2 + n)$
divisions and $n↑2 + n$ subtractions, so about three-fourths
of the work implied in (41) has been saved.

\eject % preparation for bad page break ahead
For example, suppose that we want to estimate ${3\over
2}$! from the values of $0!$, $1!$, $2!$, and $3!$, using a cubic polynomial.
The divided differences are
$$\def\\{\noalign{\vskip-4pt}}
\vjust{\halign{$\ctr{#}$⊗\quad#\quad\hfill⊗$\ctr{#}$⊗\quad$\ctr{#}$⊗\quad
$\ctr{#}$⊗\quad$\ctr{#}$\cr
x⊗\chop to 0pt{\hjust{\vrule height 9pt depth 66pt}}⊗y⊗y↑\prime⊗
y↑{\prime\prime}⊗y↑{\prime\prime\prime}\cr
\noalign{\vskip4pt}
0⊗⊗1\cr\\
⊗⊗⊗0\cr\\
1⊗⊗1⊗⊗{1\over 2}\cr\\
⊗⊗⊗1⊗⊗{1\over 3}\cr\\
2⊗⊗2⊗⊗{3\over 2}\cr\\
⊗⊗⊗4\cr\\
3⊗⊗6\cr}}$$
so $u↑{[0]}(x) = u↑{[1]}(x) = 1$, $u↑{[2]}(x)
= {1\over 2}x(x - 1) + 1$, $u↑{[3]}(x) = {1\over 3}x(x - 1)(x
- 2) + {1\over 2}x(x - 1) + 1$. Setting $x = {3\over 2}$ in the latter polynomial
gives $-{1\over 8} + {3\over8} + 1 = 1.25$; presumably the ``correct''
value is $\Gamma ({5\over 2}) = {3\over 4}\sqrtπ \approx
1.33$.

It is instructive to note that evaluation of the
interpolation polynomial is just a special case of the Chinese
remainder algorithm of Section 4.3.2, since we know the values
of $u↑{[n]}(x)$ modulo the relatively prime polynomials $x - x↓0$,
$\ldotss$, $x - x↓n$.\xskip$\biglp$As we have seen in Section 4.6.2, $f(x)$ mod
$(x - x↓0) = f(x↓0)$.$\bigrp$\xskip Under this interpretation,
Newton's formula (42) is precisely the ``mixed-radix representation''
of Eq.\ 4.3.2--24; and 4.3.2--23 yields another way to compute
$α↓0$, $\ldotss$, $α↓n$ using the same number of operations as (44).

\yskip By applying fast Fourier transforms, it is possible
to reduce the running time for interpolation to $O\biglp n\,(\log
n)↑2\bigrp$, and a similar reduction can also be made for related
algorithms such as the solution to the Chinese remainder problem
and the evaluation of an $n$th degree polynomial at $n$ different
points.\xskip[See E. Horowitz, {\sl Inf.\
Proc.\ Letters \bf 1} (1972), 157--163; R. Moenck and A. Borodin,
{\sl J. Comp.\ Syst.\ Sci.\ \bf 8} (1974), 336--385; and A. Borodin,
{\sl Complexity of Sequential and Parallel Numerical Algorithms},
ed.\ by J. F. Traub (New York: Academic Press, 1973), 149--180.]\xskip
However, this must be regarded as a purely theoretical possibility
at present, since the known algorithms have a rather large overhead
factor that makes them unattractive unless $n$ is quite large.

\yskip A remarkable modification of the method of divided
differences, an extension that applies to rational functions instead of to
polynomials, was introduced by T. N. Thiele in 1909. For a discussion
of Thiele's method of ``reciprocal differences,'' see L. M.
Milne-Thompson, {\sl Calculus of Finite Differences} (London:
MacMillan, 1933), Chapter 5; R. W. Floyd, {\sl CACM \bf 3} (1960),
508.

%New material 1 [1] (C) Addison-Wesley 1978	*
\subsectionbegin{\star Bilinear forms} Several of the problems we have considered in
this section are special cases of the general problem of evaluating a set of
{\sl bilinear forms}
$$z↓k=\chop to 12pt{\sum↓{\scriptstyle 1≤i≤m\atop\scriptstyle 1≤j≤n}
a↓{ijk}x↓iy↓j,\qquad\hjust{for }1≤k≤s,}\eqno(45)$$
where the $a↓{ijk}$ are specific coefficients belonging to some given field. The
three-dimensional array $(a↓{ijk})$ is called an $m\times n\times s$ {\sl tensor},
and we can display it by writing down $s$ matrices of size $m\times n$, one for each
value of $k$. For example, the
problem of multiplying complex numbers, namely the problem of evaluating
$$z↓1+iz↓2=(x↓1+ix↓2)(y↓1+iy↓2)=(x↓1y↓1{-}x↓2y↓2)+i(x↓1y↓2{+}x↓2y↓1),\eqno(46)$$
is the problem of computing the bilinear form specified by the $2\times2\times2$
tensor
$$\left({1\atop0}\9{0\atop-1}\right)\quad\left({0\atop1}\9{1\atop0}\right).$$
Matrix multiplication as defined in (34) is the problem of evaluating a set of
bilinear forms corresponding to a particular $mn\times ns\times ms$ tensor.
Fourier transforms (37) can also be cast in this mold, if we let the $x$'s be
constant rather than variable.

The evaluation of bilinear forms is most easily studied if we restrict ourselves
to what might be called {\sl normal} evaluation schemes, in which all chain
multiplications take place between a linear combination of the $x$'s and
a linear combination of the $y$'s. Thus, we form $t$ products
$$\qquad w↓l=(α↓{1l\,}x↓1+\cdots+α↓{ml\,}x↓m)(β↓{1l\,}y↓1+\cdots+β↓{nl\,}y↓n)\qquad
\hjust{for }1≤l≤t,\eqno(47)$$
and obtain the $z$'s as linear combinations of these products,
$$z↓k=\gamma↓{k1}w↓1+\cdots+\gamma↓{kt\,}w↓t,\qquad\hjust{for }1≤k≤s.\eqno(48)$$
Here all the $α$'s, $β$'s, and $\gamma$'s belong to a given field of
coefficients. By comparing (48) to (45), we see that a normal evaluation scheme
is correct for the tensor $(a↓{ijk})$ if and only if
$$a↓{ijk}=α↓{i1}β↓{j1}\gamma↓{k1}+\cdots+α↓{it\,}β↓{jt\,}\gamma↓{kt\,}\eqno(49)$$
for $1≤i≤m$, $1≤j≤n$, and $1≤k≤s$.

A nonzero tensor $(a↓{ijk})$ is said to be of rank one if there are three vectors
$(α↓1,\ldotss,α↓m)$, $(β↓1,\ldotss,β↓n)$, $(\gamma↓1,\ldotss,\gamma↓s)$ such
that $a↓{ijk}=α↓iβ↓j\hskip1pt\gamma↓k$ for all $i$, $j$, $k$. We can extend this
definition to all tensors by saying that {\sl the rank of $(a↓{ijk})$ is the
minimum number $t$ such that $(a↓{ijk})$ is expressible as the sum of\/\ $t$
rank-one tensors} in the given field.
Comparing this definition with Eq.\ (49) shows that the rank of a tensor is
the minimum number of chain multiplications in a normal evaluation of the
corresponding bilinear forms. Incidentally, when $s=1$ the tensor $(a↓{ijk})$
is just an ordinary matrix, and the rank of $(a↓{ij1})$ as a tensor is the
same as its rank as a matrix (see exercise 49). The concept of tensor rank was
introduced by F. L. Hitchcock in {\sl J. Math.\ and Physics \bf6} (1927),
164--189; its application to the complexity of polynomial evaluation was pointed
out in an important paper
by V. Strassen, {\sl J. f\"ur die reine und angew.\ Math.\ \bf264}
(1973), 184--202.

Winograd's scheme (35) for matrix multiplication is ``abnormal'' because it
mixes $x$'s and $y$'s before multiplying them. The Strassen-Winograd scheme
(36), on the other hand, does not rely on the commutativity of multiplication,
so it is normal. In fact, (36) corresponds to the following way to represent
the $4\times4\times4$ tensor for $2\times2$ matrix multiplication as a sum of
seven rank-one tensors ($\=1=-1$):
$$\def\\#1.#2.#3.#4.{\left(\vcenter{\baselineskip 9pt
\halign{\hjust to 24pt{\:c##}\cr#1\cr#2\cr#3\cr#4\cr}}\right)}
\vjust{\lineskip6pt\halign{$#\hfill$\cr
\\1 0 0 0.0 1 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.1 0 0 0.0 1 0 0.
\\0 0 1 0.0 0 0 1.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 1 0.0 0 0 1.=
\\1 0 0 0.1 0 0 0.1 0 0 0.1 0 0 0.
\\1 0 0 0.1 0 0 0.1 0 0 0.1 0 0 0.
\\1 0 0 0.1 0 0 0.1 0 0 0.1 0 0 0.
\\1 0 0 0.1 0 0 0.1 0 0 0.1 0 0 0.\cr
\quad\null+
\\0 0 0 0.0 1 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.+
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 \=1 \=1.0 0 0 0.0 0 1 1.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 \=1 \=1.0 0 0 0.0 0 1 1.0 0 0 0.\cr
\quad\null+
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.1 1 \=1 \=1.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.+
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.\=1 0 1 0.1 0 \=1 0.
\\0 0 0 0.0 0 0 0.\=1 0 1 0.1 0 \=1 0.\cr
\quad\null+
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\0 0 0 \=1.0 0 0 1.0 0 0 1.0 0 0 \=1.
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.+
\\0 0 0 0.0 0 0 0.0 0 0 0.0 0 0 0.
\\\=1 0 1 1.0 0 0 0.1 0 \=1 \=1.\=1 0 1 1.
\\\=1 0 1 1.0 0 0 0.1 0 \=1 \=1.\=1 0 1 1.
\\\=1 0 1 1.0 0 0 0.1 0 \=1 \=1.\=1 0 1 1..\cr}}
\eqno(50)\lower5pt\vjust to 19pt{}$$
%New material 2 [3] (C) Addison-Wesley 1978	*
\def\midlp{\mathopen{\vcenter{\hjust{\:@\char'20}}}}
\def\midrp{\mathclose{\vcenter{\hjust{\:@\char'21}}}}
The fact that (49) is symmetric in $i$, $j$, $k$ and invariant under a variety
of transformations makes the study of tensor rank mathematically tractable,
and it also leads to some surprising consequences about bilinear forms. We
can permute the indices $i$, $j$, $k$ to obtain ``transposed'' bilinear forms,
and the transposed tensor clearly has the same rank; but the corresponding
bilinear forms are conceptually quite different. For example, a normal scheme
for evaluating an $(m\times n)$ times $(n\times s)$ matrix product implies the
existence of a normal scheme to evaluate an $(n\times s)$ times $(s\times m)$
matrix product, using the same number of chain multiplications.  In matrix terms
these two problems hardly seem to be related to all---they involve different
numbers of dot products on vectors of different sizes---but in tensor terms they
are equivalent.\xskip(Cf.\ J. E. Hopcroft and J. Musinski, {\sl SIAM J.
Computing \bf2} (1973), 159--173.)

When the tensor $(a↓{ijk})$ can be represented as a sum (49) of $t$ rank-one
tensors, let $A$, $B$, $C$ be the matrices $(α↓{il})$, $(β↓{jl})$, $(\gamma↓{kl})$
of respective sizes $m\times t$, $n\times t$, $s\times t$; we shall say that
$A$, $B$, $C$ is a {\sl realization} of the tensor $(a↓{ijk})$. For example, the
realization of $2\times2$ matrix multiplication in (50) can be specified by the
matrices
$$\quad A=\left(\vcenter{\halign{#\cr
1 0 \=1 0 0 \=1 1\cr 0 1 0 0 0 1 0\cr 0 0 1 0 \=1 1 \=1\cr 0 0 0 1 1 \=1 1\cr
}}\right),\quad B=\left(\vcenter{\halign{#\cr
1 0 0 1 1 0 \=1\cr 0 1 0 1 0 0 0\cr 0 0 1 \=1 \=1 0 1\cr 0 0 1 \=1 0 1 1\cr
}}\right),\quad C=\left(\vcenter{\halign{#\cr
1 1 0 0 0 0 0\cr 1 0 1 1 0 0 1\cr 1 0 0 0 1 1 1\cr 1 0 1 0 1 0 1\cr}}\right).
\eqno(51)$$

An $m\times n\times s$ tensor $(a↓{ijk})$ can also be represented as a matrix by
grouping its subscripts together. We shall write $(a↓{(ij)k})$ for the $mn\times s$
matrix whose rows are indexed by the pair of subscripts $\langle i,j\rangle$ and
whose columns are indexed by $k$. Similarly, $(a↓{k(ij)})$ stands for the $s\times
mn$ matrix that contains $a↓{ijk}$ in row $k$ and column $\langle i,j\rangle$;\xskip
$(a↓{(ik)j})$ is an $ms\times n$ matrix, and so on. The indices of an array need not
be integers, and we are using ordered pairs as indices here. We can use this
notation to derive the following simple but useful lower bound on the rank of a
tensor.

\def\\{\hjust{\rm rank}}
\thbegin Lemma T. {\sl Let $A$, $B$, $C$ be a realization of an $m\times n\times s$
tensor $(a↓{ijk})$. Then $\\(A)≥\\(a↓{i(jk)})$, $\\(B)≥\\(a↓{j(ik)})$, and
$\\(C)≥\\(a↓{k(ij)})$; consequently}
$$\\(a↓{ijk})≥\max\biglp\\(a↓{i(jk)}),\\(a↓{j(ik)}),\\(a↓{k(ij)})\bigrp.$$

\dproofbegin It suffices by symmetry to show that $t≥\\(A)≥\\(a↓{i(jk)})$. Since
$A$ is an $m\times t$ matrix, it is obvious that $A$ cannot have rank greater than
$t$. Furthermore, according to (49), the matrix $(a↓{i(jk)})$ is equal to $AQ$,
where $Q$ is the $t\times ns$ matrix defined by $Q↓{l\langle j,k\rangle}=β↓{jl\,}
\gamma↓{kl\,}$. If $x$ is any row vector such that $xA=0$ then $xAQ=0$, hence all
linear dependencies in $A$ occur also in $AQ$. It follows that $\\(AQ)≤\\(A)$.\quad
\blackslug

\yyskip As an example of the use of Lemma T\null, let us consider the problem of
polynomial multiplication. Suppose we want to multiply a general polynomial
of degree 2 by a general polynomial of degree 3, obtaining the coefficients of
the product:
$$\twoline{\hskip-10pt(x↓0+x↓1u+x↓2u↑2)(y↓0+y↓1u+y↓2u↑2+y↓3u↑3)}{3pt}{=z↓0
+z↓1u+z↓2u↑2+z↓3u↑3+z↓4u↑4+z↓5u↑5.\qquad(52)\hskip-10pt}$$
This is the problem of evaluating six bilinear forms corresponding to the
$3\times4\times6$ tensor
$$\def\\#1.#2.#3.{\left(\vcenter{\baselineskip 10pt
\halign{##\cr#1\cr#2\cr#3\cr}}\right)}
\\1 0 0 0.0 0 0 0.0 0 0 0.
\\0 1 0 0.1 0 0 0.0 0 0 0.
\\0 0 1 0.0 1 0 0.1 0 0 0.
\\0 0 0 1.0 0 1 0.0 1 0 0.
\\0 0 0 0.0 0 0 1.0 0 1 0.
\\0 0 0 0.0 0 0 0.0 0 0 1..\eqno(53)$$
For brevity, we may write (52) as $x(u)y(u)=z(u)$, letting $x(u)$ denote the
polynomial $x↓0+x↓1u+x↓2u↑2$, etc. Note that we have come full circle from the
way we began this section, since Eq.\ (1) refers to $u(x)$, not $x(u)$; the
notation has changed because the {\sl coefficients} of the polynomials are now
the variables of interest to us.

If each of the six matrices in (53) is regarded as a vector of length 12 indexed
by $\langle i,j\rangle$, it is clear that the vectors are linearly independent,
since they are nonzero in different positions; hence the rank of (53) is at
least 6 by Lemma T\null. Conversely, it is possible to obtain the coefficients
$z↓0$, $z↓1$, $\ldotss$, $z↓5$ by making only six chain multiplications,
for example by computing
$$x(0)y(0),\;x(1)y(1),\;\ldotss,\;x(5)y(5);\eqno(54)$$
this gives the values of $z(0)$, $z(1)$, $\ldotss$, $z(5)$, and the formulas
developed above for interpolation will yield the coefficients of $z(u)$. The
evaluation of $x(j)$ and $y(j)$ can be carried out entirely in terms of additions
and/or parameter multiplications, and the interpolation formula merely takes linear
combinations of these values. Thus, all of the chain multiplications are shown in
(54), and the rank of (53) is 6.\xskip(We used essentially this same technique when
multiplying high-precision numbers in Algorithm 4.3.3C.)

The realization $A$, $B$, $C$ of (53) sketched in the above paragraph turns out
to be
$$\def\\#1{\left(\vcenter{\mathrm ccc\mathsy www\baselineskip 9pt
\def\q{\hskip4pt plus1000000000pt}
\halign{$\hfill##$⊗$\q##$⊗$\q##$⊗$\q##$⊗$\q##$⊗$\q##$\cr#1}}\right)}
\hjust to size{$\\{1⊗1⊗1⊗1⊗1⊗1\cr 0⊗1⊗2⊗3⊗4⊗5\cr 0⊗1⊗4⊗9⊗16⊗25\cr},\hfill
\\{1⊗1⊗1⊗1⊗1⊗1\cr 0⊗1⊗2⊗3⊗4⊗5\cr 0⊗1⊗4⊗9⊗16⊗25\cr 0⊗1⊗8⊗27⊗64⊗125\cr},\hfill
\\{120⊗0⊗0⊗0⊗0⊗0\cr
-274⊗600⊗-600⊗400⊗-150⊗24\cr
225⊗-770⊗1070⊗-780⊗305⊗-50\cr
-85⊗355⊗-590⊗490⊗-205⊗35\cr
15⊗-70⊗130⊗-120⊗55⊗-10\cr
-1⊗5⊗-10⊗10⊗-5⊗1\cr}{\times}{1\over120}$.}\eqno(55)$$
Thus, the scheme does indeed require the minimum number of chain multiplications,
but it is completely impractical because it involves so many additions and parameter
multiplications. We shall now study a practical approach to the generation of more
efficient schemes, suggested by S. Winograd.

In the first place, to evaluate the coefficients of $x(u)y(u)$ when deg$(x)=m$ and
deg$(y)=n$, one can use the identity
$$x(u)y(u)=\biglp x(u)y(u)\mod p(u)\bigrp+x↓my↓np(u),\eqno(56)$$
when $p(u)$ is any monic polynomial of degree $m+n$. The polynomial $p(u)$
should be chosen so that the coefficients of $x(u)y(u)\mod p(u)$ are
easy to evaluate.

In the second place, to evaluate the coefficients of $x(u)y(u)\mod p(u)$, when the
polynomial $p(u)$ can be factored into $q(u)r(u)$ where $\gcd\biglp q(u),r(u)\bigrp
=1$, one can use the identity
$$\twoline{\hskip-10pt
x(u)y(u)\mod q(u)r(u)=\midlp a(u)r(u)\biglp x(u)y(u)\mod q(u)
\bigrp}{2pt}{\null+b(u)q(u)\biglp x(u)y(u)\mod r(u)\bigrp\midrp\mod q(u)r(u)\qquad
(57)\hskip-10pt}$$
where $a(u)r(u)+b(u)q(u)=1$; this is essentially the Chinese remainder theorem
applied to polynomials. 

In the third place, to evaluate the coefficients of
$x(u)y(u)\mod p(u)$ when $p(u)$ has only one irreducible factor over the
field of coefficients, one can use the identity
$$x(u)y(u)\mod p(u)=\biglp x(u)\mod p(u)\bigrp\biglp y(u)\mod p(u)\bigrp\mod p(u).
\eqno(58)$$
Repeated application of (56), (57), and (58) tends to produce efficient schemes,
as we shall see.

For our example problem (52), let us choose $p(u)=u↑5-u$ and apply (56); the
reason for this choice of $p(u)$ will appear as we proceed. Writing $p(u)=
u(u↑4-1)$, rule (57) reduces to
$$\twoline{\hskip-10pt x(u)y(u)\mod u(u↑4-1)=\biglp-(u↑4-1)x↓0y↓0}{3pt}{\null
+u↑4\biglp x(u)y(u)\mod(u↑4-1)\bigrp\bigrp\mod(u↑5-u).\qquad(59)\hskip-10pt}$$
Here we have used the fact that $x(u)y(u)\mod u=x↓0y↓0$; in general it is a
good idea to choose $p(u)$ in such a way that $p(0)=0$, so that this simplification
can be used. If we could now determine the coefficients $w↓0$, $w↓1$, $w↓2$, $w↓3$
of the polynomial
$x(u)y(u)\mod(u↑4-1)=w↓0+w↓1u+w↓2u↑2+w↓3u↑3$, our problem would be solved,
since $$u↑4\biglp x(u)y(u)\mod(u↑4-1)\bigrp\mod(u↑5-u)=w↓0u↑4+w↓1u+w↓2u↑2+w↓3u↑3,$$
and the combination of (56) and (59) would reduce to
$$x(u)y(u)=x↓0y↓0+(w↓1-x↓2y↓3)u+w↓2u↑2+w↓3u↑3+(w↓0-x↓0y↓0)u↑4+x↓2y↓3u↑5.\eqno(60)$$
(This formula can, of course, be verified directly.)

The problem remaining to be solved is to compute $x(u)y(u)\mod(u↑4-1)$; and this
subproblem is interesting in itself. Let us momentarily allow $x(u)$ to be of
degree 3 instead of degree 2. Then the coefficients of $x(u)y(u)\mod(u↑4-1)$ are
respectively
$$\twoline{x↓0y↓0+x↓1y↓3+x↓2y↓2+x↓3y↓1,\quad
x↓0y↓1+x↓1y↓0+x↓2y↓3+x↓3y↓2,}{3pt}{x↓0y↓2
+x↓1y↓1+x↓2y↓0+x↓3y↓3,\quad x↓0y↓3+x↓1y↓2+x↓2y↓1+x↓3y↓0,}$$
and the corresponding tensor is
$$\def\\#1.#2.#3.#4.{\left(\vcenter{\baselineskip 10pt
\halign{##\cr#1\cr#2\cr#3\cr#4\cr}}\right)}
\\1 0 0 0.0 0 0 1.0 0 1 0.0 1 0 0.
\\0 1 0 0.1 0 0 0.0 0 0 1.0 0 1 0.
\\0 0 1 0.0 1 0 0.1 0 0 0.0 0 0 1.
\\0 0 0 1.0 0 1 0.0 1 0 0.1 0 0 0..\eqno(61)$$
In general when deg$(x)=\hjust{deg}(y)=n-1$, the coefficients of $x(u)y(u)\mod
(u↑n-1)$ are called the {\sl cyclic convolution} of $(x↓0,x↓1,\ldotss,x↓{n-1})$
and $(y↓0,y↓1,\ldotss,y↓{n-1})$. The $k$th coefficient $w↓k$ is the bilinear form
$\sum x↓{i\,}y↓j$ summed over all $i$ and $j$ with $i+j≡k\modulo n$.

The cyclic convolution of degree 4 can be obtained by applying rule (57). The
first step is to find the factors of $u↑4-1$, namely $(u-1)(u+1)(u↑2+1)$. We could
write this as $(u↑2-1)(u↑2+1)$, then apply rule (57), then use (57) again on
the part modulo $(u↑2-1)=(u+1)(u-1)$; but it is easier to generalize the
Chinese remainder rule (57) directly to the case of several relatively prime
factors, e.g.,
$$\baselineskip15pt
\vjust{\halign to size{$#$\cr
x(u)y(u)\mod q↓1(u)q↓2(u)q↓3(u)\hfill\cr
\quad=\midlp a↓1(u)q↓2(u)q↓3(u)\biglp
x(u)y(u)\mod q↓1(u)\bigrp+a↓2(u)q↓1(u)q↓3(u)\biglp x(u)y(u)\mod q↓2(u)\bigrp\cr
\hfill\null+a↓3(u)q↓1(u)q↓2(u)\biglp x(u)y(u)\mod q↓3(u)\bigrp\midrp
\mod q↓1(u)q↓2(u)q↓3(u),\quad(62)\cr}}$$
where $a↓1(u)q↓2(u)q↓3(u)+a↓2(u)q↓1(u)q↓3(u)+a↓3(u)q↓1(u)q↓2(u)=1$.\xskip
$\biglp$The latter equation can be understood in another way, by noting that
$a↓1(u)/q↓1(u)+a↓2(u)/q↓2(u)+a↓3(u)/q↓3(u)$ is the partial fraction expansion
of $1/q↓1(u)q↓2(u)q↓3(u)$. When each of the $q$'s is a linear polynomial
$u-α↓i$, the generalized Chinese remainder rule reduces to ordinary
interpolation as in Eq.\ (41), since $f(u)\mod(u-α↓i)=f(α↓i).\bigrp$\xskip
From (62) we obtain
$$\twoline{\hskip-10pt\textstyle
x(u)y(u)\mod(u↑4-1)=\midlp{u↑3+u↑2+u+1\over4}x(1)y(1)-
{u↑3-u↑2+u-1\over4}x(-1)y(-1)}{4pt}{\textstyle
\null-{u↑2-1\over2}\biglp x(u)y(u)\mod
(u↑2+1)\bigrp\midrp\mod(u↑4-1).\quad(63)\hskip-10pt}$$